Extracting geographic locations from the literature for virus phylogeography using supervised and distant supervision methods

نویسندگان

  • Davy Weissenbacher
  • Abeed Sarker
  • Tasnia Tahsin
  • Matthew Scotch
  • Graciela Gonzalez
چکیده

The field of phylogeography allows researchers to model the spread and evolution of viral genetic sequences. Phylogeography plays a major role in infectious disease surveillance, viral epidemiology and vaccine design. When conducting viral phylogeographic studies, researchers require the location of the infected host of the virus, which is often present in public databases such as GenBank. However, the geographic metadata in most GenBank records is not precise enough for many phylogeographic studies; therefore, researchers often need to search the articles linked to the records for more information, which can be a tedious process. Here, we describe two approaches for automatically detecting geographic location mentions in articles pertaining to virus-related GenBank records: a supervised sequence labeling approach with innovative features and a distant-supervision approach with novel noise- reduction methods. Evaluated on a manually annotated gold standard, our supervised sequence labeling and distant supervision approaches attained F-scores of 0.81 and 0.66, respectively.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Simple Queries as Distant Labels for Predicting Gender on Twitter

The majority of research on extracting missing user attributes from social media profiles use costly hand-annotated labels for supervised learning. Distantly supervised methods exist, although these generally rely on knowledge gathered using external sources. This paper demonstrates the effectiveness of gathering distant labels for self-reported gender on Twitter using simple queries. We confir...

متن کامل

Distant Supervision for Cancer Pathway Extraction from Text

Biological pathways are central to understanding complex diseases such as cancer. The majority of this knowledge is scattered in the vast and rapidly growing research literature. To automate knowledge extraction, machine learning approaches typically require annotated examples, which are expensive and time-consuming to acquire. Recently, there has been increasing interest in leveraging database...

متن کامل

Extracting PICO Sentences from Clinical Trial Reports using Supervised Distant Supervision

Systematic reviews underpin Evidence Based Medicine (EBM) by addressing precise clinical questions via comprehensive synthesis of all relevant published evidence. Authors of systematic reviews typically define a Population/Problem, Intervention, Comparator, and Outcome (a PICO criteria) of interest, and then retrieve, appraise and synthesize results from all reports of clinical trials that meet...

متن کامل

Extracting microRNA-gene relations from biomedical literature using distant supervision

Many biomedical relation extraction approaches are based on supervised machine learning, requiring an annotated corpus. Distant supervision aims at training a classifier by combining a knowledge base with a corpus, reducing the amount of manual effort necessary. This is particularly useful for biomedicine because many databases and ontologies have been made available for many biological process...

متن کامل

Distantly supervised Web relation extraction for knowledge base population

Extracting information from Web pages for populating large, cross-domain knowledge bases requires methods which are suitable across domains, do not require manual effort to adapt to new domains, are able to deal with noise, and integrate information extracted from different Web pages. Recent approaches have used existing knowledge bases to learn to extract information with promising results, on...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره 2017  شماره 

صفحات  -

تاریخ انتشار 2017